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Abstract 

The disagreement coefiicient of Hanneke lias become a central data independent invariant 
in proving active learning rates. It has been shown in various ways that a concept class 
with low complexity together with a bound on the disagreement coefficient at an optimal 
solution allows active learning rates that are superior to passive learning ones. 

We present a different tool for pool based active learning which follows from the exis- 
tence of a certain uniform version of low disagreement coefficient, but is not equivalent to 
it. In fact, we present two fundamental active learning problems of significant interest for 
which our approach allows nontrivial active learning bounds. However, any general purpose 
method relying on the disagreement coefficient bounds only fails to guarantee any useful 
bounds for these problems. The applications of interest are: learning to rank from pairwise 
preferences, and clustering with side information (a.k.a. semi-supervised clustering). 

The tool we use is based on the learner's ability to compute an estimator of the differ- 
ence between the loss of any hypotheses and some fixed "pivotal" hypothesis to within an 
absolute error of at most e times the disagreement measure (ii distance) between the two 
hypotheses. We prove that such an estimator implies the existence of a learning algorithm 
which, at each iteration, reduces its in-class excess risk to within a constant factor. Each 
iteration replaces the current pivotal hypothesis with the mininiizer of the estimated loss 
difference function with respect to the previous pivotal hypothesis. The label complexity 
essentially becomes that of computing this estimator. 

Keywords: active learning, learning to rank from pairwise prefernces, semi-supervised 
clustering, clustering with side information, disagreement coefficient, smooth relative regret 
approximation 
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1. Introduction 



Unlike in standard PAC learning, an active learner chooses which instances to learn from. 
In the streaming setting, they may reject labels for instances arriving in a stream, and in 
the pool setting they may collect a pool of instances and then choose a subset from which to 
ask labels for. Although a relatively young field compared to tradition a,l (passive) learning 
there is by now a significant body o f literature on the subject (see, e . g..lFreund et al. . 1997 : 



Dasguptal . 



Hanneke, 



2005 



2007 



201 



20111 : 



Castro et al.l . l2005l : iKaariainenl. 120061: iBalcan et al 



2006; 



Sugivamal. 



2006 



T r — J f — r r ' j i 1 1 ' ' j i ■ ' ^ • ' 

Ba,lcan et al.l. 120071: iDasgupta et a l.. 20071 : iBachl. 120071: ICastro and Nowak . 



2008; Ba lcan et al.l . boosl: 'Pasgupta and Hsul ^2008: Cavallanti et al.l. 120081: iHannekel. 12009 : 



Bevgelzimer et al.l. 1200 9. 2010; Koltchinskii, 2010; Cesa-Bia nchi et al. 



Hanneke and Yangj.l20ld:lEl-Yaniv and Wiener . 201ol:lHannekj . 



Cavallant 



i et al.1 . 1201 ll : lYang et al.l . l201ll : IWand . l201ll : iMinskei 



17 ■_s^jyi^^y_^i|j_pji^^||^^^gp^7 I 17 I Q 17 ■ ' 7 i ■ ■ q i; 

vey by ISettlesI ( 20091 ) for definition of active learni ng. 



201C; 



Yang et al 



2011 : lOrabona and Cesa-Bianchil . 
2OI2I). Refer to asur- 



The disagreement coefficient of iHannekd (|2007l ) has become a central data independent 
invariant in proving active learning rates. It has been shown in various ways that a con- 
cept class with low complexity together with a bound on the disagreement coefficient at 
an optimal solution allows active lea rning rates that are superior to pa s sive rates under 



certa i n low noise conditions (see, e.g.. iHannekd. 



2007 



Balcan et al.l . 120071 : iDasgupta et al. 



2007 : Castro and Nowak . 20081 : Beygelzimer et al. . 2010). The best results assuming only 
bounded VC dimension d and disagreement coefficient 9 can roughly be stated as follows: 
If the sought (in-class) excess risk fi is the same order of magnitude as the optimal error 
u or larger, then the number of required queries is roughly O(0(i log(l//u))0 Otherwise, 
the number is roughly 0{6dv'^ / ^'^). Note that this results makes no assumption on the 
noise (except maybe for its magnitude). Better results cai i be made by assuming certain - 
statistical prope rties of the noise (especially the model of Mammen and Tsvbakov . 19991 : 
Tsvbakovl . f2noi ). 

The idea behind the disagreement coefficient is intuitive and simple. If a hypothesis h is 
r-close to optimal, then the difference between their losses (the regret of h) can be computed 
from instances in the disagreement region only, defined as the set of instances on which the 
r-ball round the optimal is not unanimous on. This means that for minimizing regret, 
one may restrict attention to hypotheses lying in iteratively shrinking version spaces and to 
instances in the corresponding disagreement regions, which are shrinking in tandem with the 



versio n spaces if the disagreement coefficient is small. As pointed out in iBeygelzimer et al 



(l20ld ). ignoring hypotheses outside the version space is brittle business, because a mistake 
in computation of the version space dooms the algorithm to fail. They propose a scheme in 
which no version space is computed. Instead, a certain importance weighted scheme is used. 
We also use importance weighting, but in the pool based setting and not in the streaming 
setting as they doJl 

Analyzing the difference between losses of hypotheses ( "relative regrets" ) is used almost 
in all theoretical work on active learning, but not attached directly. In this work we argue 
that by carefully constructing empirical processes uniformly estimating the relative regret 



1. The O notation suppresses polylogarithmic terms. 

2. Note that a practitioner can pretend that any pool based input is a stream, though that approach would 
probably not take full advantage of the data. 



2 



Active Learning Using SRRAs 



of all hypotheses with respect to a fixed "pivotal" hypothesis yields fast active learning 
rates. We call such constructions SRRA (Smooth Relative Regret Approximations). 

We also show that low disagreement coefficient and VC dimension assumptions imply 
such efficient constructions, and give rise to yet another proof for the usefulness of the 
disagreement coefficient in active learning. Our resulted algorithm that does not need to 
compute or restrict itself to shrinking version spaces. We then present two fundamental 
pool based learning problems for which direct SRRA construction yields superior active 
learning rates, whereas any known argument that uses the disagreement coefficient only, 
requires the practitioner to obtain labels for the entire pool (!) even for moderately chosen 
parameters. We conclude that the SRRA method is, up to minor factors, at least as good 
as the disagreement coefficient method, but can be significantly better in certain cases. 

We note that another important line of design and analysis of active learning algo r ithms 



makes certain structural or bayesian assumptions on the r ioise (e.g.. iBalcan et al 



Castro and Nowakl. boosi: iHannekel . I2OO9I : iKoltchmskii boid : lYang et al.l . l20ld : IWane 



Yang et al.l . 



2011 



2007; 



2011 



Minskerl . l2012l ). We expect that one can get yet improved analysis in out 



framework under these assumptions. We leave this to future work. 

The rest of the paper is laid out as follows: In Section [2] we present notations and basic 
definitions, including an introduction to our method. In Section [3] we show that existence 
of low disagreement coefficient implies our method, in some sense. Then, we present our 
two main applications, learning to rank from pairwise preferences (LRPP) in Section U] and 
clustering with side information in Section [5l In Section [6] we present additional results and 
practical considerations, and in particular how to use our methods with convex relaxations 
if the ERM problems that arise in the discussion are too difficult (computationally) to 
optimally solve. We conclude in Section [7] and suggest future directions. 



2. Definitions and Notation 



We follow the notation of iHannekd (120 111 ): Let X be an instance space, and let y = {0, 1} 
be a label space. Denote by T) the distribution over X x y, with corresponding marginals 
T>x and T>y. In this work we assume for convenience that each label y is a deterministic 
function of X, so that if X ~ T>x then (X ,Y (X)) is distributed according to D. 

By C we denote a concept class of functions mapping X to y. The error rate of a 
hypothesis h £ C equals 



evvih) = E^x,Y)-v[HX) / YiX)] . 

The noise rate z/ of C is defined as u = inf/jgc erx)(^). We will focus on the scenario in 
which v is attained at an optimal hypothesis /i*, so that erx)(/i*) = v. Define the distance 
dist(/ii, /12) between two hypotheses /ii,/i2 G C as ViXr^Vxl^ii^) 7^ ^2{X)\, observe that 
dist(-, •) is a pseudo- metric over pairs of hypotheses. For a hypothesis h £ C and a number 
r > 0, the ball B{h,r) around h of radius r is defined as {h' € C : dist(/i, h') < r}. For a 
set y C C of hypotheses, let DlS(y) denote 



DlS(y) = {xeX : 3/11, /i2 G V s.t. hi{x) / h2{x)} . 
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2.1 The Disagreement Coefficient 

The disagreement coefficient of h with respect to C under Vx is defined as 

,,,,,/^x,..[DlS(B(/..0)|^ (1) 

r>0 r 

where Pi^v^ i^] fo^' W C denotes the probabiHty measure with respect to the distribution 
"Dx ■ Define the uniform disagreement coefficient 9 as sup^^g^^ 9h , namely 

Prp-, mS (B(h,r))] 
6 = sup sup — ^ \ \ ^ JJi _ (^2) 

h£C T>o r 

Remark 1 A useful slight variation of the definitions of Oh and 9 can he obtained by re- 
placing sup^^o sup,,.>j^ in m) and We will explicitly say when we refer to this 
variation in what follows. 

2.2 Smooth Relative Regret Approximations (SRRA) 

Fix h (z C (which we call the pivotal hypothesis). Denote by reg/j : C i— )• M the function 
defined as 

reghih') = er-p(/i') - erv{h) . 

We call reg^ the relative regret function with respect to h. Note that for h — h* this is 
simply the usual regret, or (in-class) excess risk function. 

Definition 2 Let / : C i— ?• M &e any function, and < e < 1/5 and < fi < 1. We say that 
f is an (e, /i)-smooth relative regret approximation ((e,^)-SRRA) with respect to h if for 
all h' G C, 

\f{h')-mgh{h')\<e-{Aisi{h,h') + ii) . 
If l-i = we simply call f an e-smooth relative regret approximation with respect to h. 

Although the definition is general, the applications we study in details fall under the 
category of pool based active learning, in which X is a finite set of size and Picj)^ is the 
uniform measure. Hence, taking /x = 0(1/A^) is tantamount here to /x = 0. This will be 
useful in what follows. The following theorem and corollary constitute the main ingredient 
in our work. 

Theorem 3 Let h (z C and f be an (e, fj,)-SRRA with respect to h. Let hi = argmin^j/gj^ /(^')- 
Then 

eivihi) = (1 + 0(e)) u + O {e ■ eiv{h)) + 0(e/i) . 

Proof Let h* = argmin^^/g^ reg;j(/i'). Applying the definition of (e,/i)-SRRA we have: 

erx)(/ii)<erx)(/i) + /(/ii) + edist(/i, hi) + eji 
<eiv{h) + f{h*) + edist(/i, hi) + 

<eiv{h) + 1^ — eTv{h) + edist(/i, h*) + edist(/i, hi) + 2e// 

<u + e(2dist(/i, h*) + dist(/ii, /i*)) + 2e/i. (3) 
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The first inequality is from the definition of (e, ;u)-SRRA, the second is from the fact that hi 
minimizes /(•) by construction, the third is again from definition of (e,^)-SRRA and from 
the definition of the relative regret function reg/^, the fourth is by the triangle inequality. 
Now, clearly for any two hypotheses g,g' G C we have that dist{g,g') < eicx>{g) + ^^vig') 
by the triangle inequality. The proof is completed by plugging dist(/i,/i*) < erx)(/i) + z^, 
and dist(/ii, h*) < erx)(/ii) + i' into Equation^ subtracting e ■ erx)(/ii) from both sides, and 
dividing by (1 — e). ■ 



A simple inductive use of Theorem [3] proves the following corollary, bounding the query 
complexity of an ERM based active learning algorithm (see Algorithm [1] for corresponding 
pseudocode). Note that this algorithm never restricts itself to a shrinking version space. 

Corollary 4 Let /iq, /ii, /12, • • • be a sequence of hypotheses in C such that for alii > 1, 

hi = argmin^/gc where is an {e, fi)-SRRA with respect to hi-i. Then for all 

i > 0, 

eMhi) = (1 + 0(e)) + Oie')evv{ho) + 0(e//) . 
Proof Applying Theorem [3] with hi and /ij-i, we have 

ervihi) = (1 + 0(e)) 1^ + O {e ■ erv{hi-i)) + 0{efi) . 
Solving this recursion, one gets 

eivihi) = Y.^''^ (1 + y + O(e') ■ er2,(/io) + O ( ^ ^ . 

i=i \i=^ I 

The result follows easily by bounding geometric sums. ■ 



Algorithm 1 An Active Learning Algorithm from SRRA's 

Input: an initial solution ho G C, estimation parameters e G (0, l/5),/x > 0, and number 
of iterations T 
1: z ^ 
2: repeat 

3: /ij+i ^ argmin^/gc /(/i'), where / is an (e, //)-smooth relative regret approximation 

with respect to hi 
4: i ^ z + 1 
5: until i equals T 
6: return hT 



We will show below problems of interest in which (e, ^)-SRRA's with respect to a given 
hypothesis h can be obtained using queries Y{X) at few randomly (and adaptively) selected 
points X from the pool X, if the uniform disagreement coefficient 9 is small. This will 
constitute another proof for the usefulness of the disagreement coefficient in design and 
analysis of active learning algorithms. We then present two problems for which a direct 
construction of an SRRA yields a significantly better query complexity than that guaranteed 
using the disagreement coefficient alone. 
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3. Constant Uniform Disagreement Coefficient Implies Efficient SRRA's 

We show that a bounded uniform disagreement coefficient imphes existence of query efficient 
{e, /u)-SRRA's. This constitutes yet another proof of the usefulness of the disagreement 
coefficient in design of active learning algorithms, via Algorithm [TJ 

3.1 The Construction 

Returning to our problem, assume the uniform disagreement coefficient 9 corresponding to 
C is finite and z/ > 0. Fix some failure probability 6. We consider the range space (X,C*), 
defined by 



r = ( U {{XeX:h'{X)=0}]\ U ( U {{XeX:h'{X) = l}}\ . 
V/i'eC / \/i'ec / 

In other words, C* is the collection of all subsets S" C Af , whose elements X £ S are mapped 
to the same value (0 or 1) by h', for some h' E C. Assume {X,C*) has VC dimension d, and 
fix /i G C. Let L = [log . Define = DlS{B{h, fj,)) and for i = 1, 2, . . . , L, define Xi to 
be 

Xi = mS{B{h, fi2')) \BlS{B{h, fi2'^^)) . 

Let 7]i = PiD^[Xi] be the measure of Xi, and 5 an hyper-parameter. For each i >0 draw a 
sample X-i^i, . . . ,Xi^m oi m = O (e~^0 (dlog^ + log (5^^ log(l/^)))) examples in Xi, each 
of which drawn independently from the distribution 'Dx\Xi (with repetitions). (By T>x\Xi 
we mean, the distribution Vx conditioned on Xi.) We will now define an estimator function 
/ : C I— )• M of reg^, as follows. For any h' G C and z = 0, 1, . . . , L let 

m 

i=i 

Our estimator is now defined as f{h') = J2i=o fii.^')- We next show: 

Theorem 5 Let f , h, h' , m he as above. With probability at least 1 — S, f is an {e, fj,)-SRRA 
with respect to h. 

Proof A main tool to be exploited in the proof is called relative e- approximations due to 
Haussleil hmi ) and iLi et al.l hood ). It is defined as follows. Let h G X ^ be some 



function, and let ji^ = Exr^Vx Let Xi, . . . , X^ denote i.i.d. draws from T>x, and let 

fj'h — ^ ■ Z^i^i h{Xi) denote the emprical average. Let k > be an adjustable parameter. 
We are going to use the following measure of distance between ///j and its estimator jl^, to 
determine how far the latter is from the true expectations: 



fJ.h + lj'h + K 



This measure corresponds to a relative error when approximating fi^ by fih- Indeed, let 
e > be our approximation ratio, and put df^{nh, fih) < £■ This easily yields 

\l^h - < IJ-h + z 1^- (4) 
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In other words, this imphes that — fi^l < 0{e){fj,h + k). 

Let us fix a p arameter < S < 1. Assume C is a set of {0, 1} valued functions on X of 
VC dimension d. Li et al. (j2000l ) show that if one samples 



m 



examples as above, then (j4]) holds uniformly for all /i € C with probability at least 1 — 5. 

We now apply this definition of relative e- approximations, and the corresponding results 
within our context. For any h' , we define the following four sets of instances: 





= {X 


G a; : 


h'{X) 


= Y{X) = 1 and h{X) 


= 0} 


K'- 


= {X 


G a; : 


h'{X) 


= 1 and h{X) = Y{X) 


= 0} 




= {X 


G a; : 


h'{X) 


= and h{X) = Y{X) 


= 1} 




= {X 


G a: : 


h'{X) 


= Y{X) = and h{X) 


= 1} 



Observe that the set {X G X : h{X) ^ h'{X)} equals to the disjoint union of R^:^ , R^i , 

and Ry^r. For each z = 0, . . . , L and 6 G {++, H — , — h, } let ■ = i?^, n Xi. 

Let = {i?^, . : h' G C}. It is easy to verify that the VC dimension of the range spaces 
(^Xi,TZ^^ is at most d. Each set in TZ^ is an intersection of a set in C* with some fixed set. 

For any R C Xi let pi{R) = Frx^V;,\xAX G R], and pi{R) = m'^ Y.T=i l^»,.ei?- Note 
that Pi{R) is an unbiased estimator of Pi{R). 

By the choice of m, inequality dH, and the assumptions on 6 and v we have that with 
probability at least 1 - 5/L, for all R C 7^++ U 7^+" U 7^^+ u 7^^-, 

\p,{R) - pi{R)\ = 0{e) ■ {p,{R) + e-^) , (5) 

and by the probability union bound we obtain that this uniformly holds for alH = 0, . . . , L 
with probability at least 1 — 5. 

Now fix /i' G C and let r = <i\si{h,h'). Let r{i) = [log(r/;u)]. By the definition of Xi, 
h[X) = h'{X) for all X Xi whenever i > r(i). We can therefore decompose reg/j(/i') as: 



reg/j(/i') = erx5(/i') - erx)(/i) 



9 — n ^ 



j=0 
r(i) 

i=0 

r{i) 

1=0 



Pr [Y{X)^h\X)]- Pr / 



p,{Rl+) + p,{R+-) + p^{R^^) - PiiR^r) 



On the other hand, we similarly have that 

r(i) 

f{h') = 5]r?.-(-/5,(i?++)+/)i(i?+-) + /5,(i?-+)-p,(i?--)) 
i=0 
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Combining, we conclude using ^ that 

|reg,(/i') - fih')\ <oieJ2v^■ {piiK'^) + P^{Kr) + p^iRj/) + p^{Rir) + 40-^) ) (6) 



But now notice that X^iio ^* ' \Pii^h' ) + Pii^h' ) + Pii^h' ) + Pii^h' ) j equals r, since it 

corresponds to those elements X € on which h, h' disagree. Also note that Y^i^o Vi is at 
most 2 max {Pr©^ [DIS {B{h, r))] , Ptt>x {13{h, fj-))]}- By the definition of 9, this implies 
that the RHS of 1^ is bounded by e(r + /x), as required by the definition of (e,;u)-SRRAll 



Corollary 6 An {e, fi)-SRRA with respect to h can be constructed, with probability at least 
1 — 5, using at most 

m (1 + riog(l//i)l) = O {9e-^ (log(l/^)) (dlogfl + log{5-' log(l//i)))) (7) 
label queries. 

Combining Corollaries U] and [6] (Algorithm [1]) , we obtain an active learning algorithm 
in the ERM setting, with query complexity depending on the uniform disagreement co- 
efficient and the VC dimension. Assume (5 is a constant. If we are interested in excess 
risk of order at least that of the optimal error z/, then we may take £ to be, say, 1 /5 and 
achieve the sought bound by constructing (1/5, z^)-SRRA's using 0{9d{log{l/i')){log9)), 
once for each of 0(log(l/i^)) iterations of Algorithm [TJ If we seek a solution with error 
(1 + £)iy, we would need to construct (e, z^)-SRRA's using 0{9d£^'^(log{l/i')){log9)) query 
labels, one for each of 0(log(l/i^)) iterations of the algorithm. The total label query com- 
plexity is 0(6'd(log^(l/i^))(log 9)), which is 0(log(l/z^)) times the best known bounds using 
disagreement coefficient and VC dimension bounds only. 

A few more comparison notes are in place. First, note that in known arguments bound- 
ing query complexity using the disagreement coefficient, the disagreement coefficient 9h* 
with respect to the optimal hypothesis h* is used in the analysis, and not the uniform 
coefficient 9. Also note that both in previously known results bounding query complexity 
using disagreement coefficient and VC dimension bounds as well as in our result, the slight 
improvement described in Remark [1] applies. In other words, all arguments remain valid if 
we replace the supremums in ([T]) and ([2]) with sup^>^,. 

4. Application 7^1: Learning to Rank from Pairwise Preferences (LRPP) 

"Learning to Rank" takes various forms in theory and practice of learning, as well as in 
combinatorial optimization. In all versions, the goal is to order a set V based on constraints. 

A large body of learning literature considers the following scenario: For each v V 
there is a label on some discrete ordinal scale, and the goal is to learn how to order V 

3. The O-notation disappeared because we assume that the constants are properly chosen in the definition 
of the sample size m. 
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so as to respect induced pairwise preferences. For example, a scale of {1,2,3,4,5}, as in 
hotel/restaurant star quality; where, if u has a label of 5 ("very good") and v has a label 
of 1 ("very bad"), then any ordering that places v ahead of u is penalized. Note that even 
if the labels are noisy, the induced pairwise preferences here are always transitive, hence no 
combinatorial problem arises. Our work does not deal with this setting. 

When the basic unit of information consists of preferences over pairs u,v , then the 
problem becomes combinatorially interesting. In case all quadratically many pairwise pref- 
erences are given for free, the corresponding^ptimization problem is known a s Minimum 
Feedback Arc-Set in Tournaments ( MFASTlH MFAST is NP-hard dAlonl . bnOfil ). Recently, 



Kenvon-Mathieu and Schudvl (|2007l ) show a PTAS for this (passive learning) problem. Sev- 



eral important r ecent works address the challenge of approximating the minimum feedbac k 
arc-set problem ( Ailon et al. . 20081 : Braverman and Mossel . 20081 : Coppersmith et al. . 2O10l ). 

Here we consider a query efficient variant of the problem, in which each preference comes 
with a cost, and the goal is to produce a competitive solution while r educing the preference- 
query overhead. O ther very recent work consid er similar settings (jjamieson and Nowakl . 
2011 : Ailon, 20121 ). Jamieson and Nowak ( 2011 ) consider a common scenario in which the 



alternatives can be characterized in terms of d real-valued features and the ranking obeys 
the structure of the Euclidean distances between such embeddings. They present an active 
learning algorithm that requires, using average case analysis, as few as 0{d\ogn) labels in 
the noiseless case, and 0{d\o^ n) labels under a certain parametric noise model. Our work 
uses worst-case analysis, and assumes an adversarial noise model. In this Section we analyze 
the pure combinatorial problem (not assuming any feature embeddings). In Section [6] we 
tackle the problem with linearly induced permutation over feature space embeddings. 

AilonI (|2012l ) consider the same se tting as ours . Our main result Corollary [10] is a slight 



improvement over the main res ult of Ailon ("20 li) in query complexity, but it provides an- 
other significant improvement. lAilon (2012) uses a querying method that is based on a 
divide and conquer strategy. The weakness of such a strategy can be explained by con- 
sidering an example in which we want to search a restricted set of permutations (e.g., the 
setting of Section l6.ip . When dividing and conquering, the algorithm in lAilonI (I2OI2I ) IS 
doomed to search a cartesian product of two permutations spaces (left and right). There is 
no guarantee that there even exists a permutation in the restricted space that respects this 
division. In our querying algorithm this limitation is lifted. 



4.1 Problem Definition 

Let y be a set of n elements (alternatives). The instance space X is taken to be the set of 
all distinct pairs of elements in V, namely V y<V\\^{u,u) : u G V^. The distribution is 
uniform on X . The label function Y : X ^ {0, 1} encodes a preference function satisfying 
Y{{u,v)) = 1 -y((t;,n)) for all n,v e By convention, we think of y((n, w)) = 1 as a 
stipulation that u is preferred over v. For convenience we will drop the double-parentheses 
in what follows. 



4. A maximization version exists as well. 

5. Note that we could have defined X to be unordered pairs of elelements in V without making any 
assumption on Y . We chose this definition for convenience in what follows. 
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The class of solution functions C we consider is all /i : A" — t- {0, 1} such that it is 
skew- symmetric: h{u,v) = 1 — h{v,u), and transitive: h{u,z) < h{u,v) + h{v,z) for all 
distinct u,v, z G V. This is equivalent to the space of permutations over V, and we will 
use the notation tt, o", . . . instead of h, h' , . . . in the remainder of the section. We also use 
notation u -<tj ?; as a predicate equivalent to Tr{u,v) = 1. Endowing X with the uniform 
measure, dist(7r, a) turns out to be (up to normalization) the well known kendall-r distance: 
dist(7r,cj) = iV"^E« =iv '^TT(u,v)^a{u,v) ! where A'^ — n{n — 1) is the number of all ordered pairs. 

4.2 The Weakness of Using Disagreement Coefficient Arguments 

Let us first see if we can get a usefu l activ e learning algorithm using disagreement coefficient 
guments. It has been shown in lAilonI ^2Qli ) that the uniform disagreement coefficient 



ar 



of C is J7(n). To see this simple fact, notice that if we start from some permutation vr 
and swap the positions of any two elements u,v G V, then we obtain a permutation of 
distance at most 0{l/n) away from vr, hence the disagreement region of the ball of radius 
0(l/n) ar ound vr is the entire space X. It is also known that the VC dimension of C is 



n — 1 (see, iRadinskv and Ailonl . l201lh . Using Corollary [6l we conclude that we would need 
J7(n^) preference labels to obtain an (e, //)-SRRA for any meaningful pair (e, /i). This is 
uninformative because the cardinality of X is O(n^). A similar bound is obtained using any 
known active learning bound using disagreement coefficient and VC-dimension bounds only. 



Remark 7 A slight improvement can be obtained using the refined definition of disagree- 
ment coefficients of Remark [Jl Observe that the uniform disagreement coefficient, as well 
as the disagreement coefficient at the optimal solution h* becomes 6 = Oh* = 0{l/v), if 
This improves the query complexity bound to 0{nv ^). If v tends to n ^ from 
above, in the limit this becomes a quadratic (in n) query complexity. 

We next show how to construct more useful (in terms of query complexity) SRRA's for 
LRPP, for arbitrarily small v. 



4.3 Better SRRA for LRPP 

Consider the following idea for creating an e-SRRA for LRPP, with respect to some fixed 
vr G C. We start by defining the following sample size parameter: 

p^O{e-^\og^n) . (8) 

For all u G y and for alH = 0, 1, . . . , [log n] , let Iu,i Denote the set of elements v such that 
2^p < |vr('u) — tt{v)\ < 2'^^^p where, abusing notation, vr('u) is the position of u in vr. For 
example, vr(u) is 1 if u beats all other elements, and n if it is beaten by all others. From 
this set, choose a random sequence Ru,i = ivu,i,i,Vu,i,2, ■ ■ ■ , 'Vu,i,p) of p elements, each chosen 
uniformly and independently from I^^i- 

6. Due to symmetry, the uniform disagreement coefRcient here equals 6h for any h € C. 

7. A variant of this sampling scheme is as follows: For each pair (m, w), add it to S with pr o babil- 
ity proportional to minjl, p/lvrfw) — 7r(t;) |}. A similar scheme can be found in (jAilon et al.l . l2007l : 
iHalevv and Kushilevitj . l2007l : lAilonl . I2OI2I ) but the strong properties proven here were not known. 
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For distinct u,v &V and a permutation o" € C, let Cu,v{(^) denote the contribution of 
the pair u,v to erx)((T), namely: 

C„,i,((t) = iV~H^(„^^)-^y(„^„) . (9) 

(Note that Cu,v = Cv,u-) Our estimator /(cr) of reg^{a) = er-D^a) — erx)('/r) is defined as 

[log"! I p 

/(^) = E E ^ E (^«.''...(^) - ^«,^...(^)) • (10) 

u£V i=0 ^ t=l 

Clearly, f{a) is an unbiased estimator of iceg^{a) for any a. Our goal is to prove that f{a) 
is an e-SRRA. 

Theorem 8 With probability at least l—n~^, the function f is an e-SRRA with respect to n. 

Proof The main idea is to decompose the difference \f{a) — reg^{a) \ vis-a-vis corresponding 
pieces of dist((j, vr). The first half of the proof is devoted to definition of such distance 
"pieces." Then using counting and standard deviation-bound arguments we show that the 
decomposition is, with high probability, an e-SRRA. 

We start with a few definitions. Abusing notation, for any vr € C and u G ^ let 7r{u) 
denote the position of u in the unique permutation that vr defines. For example, ■7r{u) = 1 
if u beats all other alternatives: tt{u,v) = 1 for all v ^ u; Similarly vr(ti) = n if n is beaten 
by all other alternatives. For any permutation a E C, we define the corresponding profile 
of a as the vectorH 

prof (cr) = (cr(Mi) - vr(ui), a{u2) - vr(n2), . . . , cr(u„) - vr(M„)) . 

Note that ||prof(cr)||-|^ is the well-known Spearman Footrule distance between a and vr, which 
we denote by (isF(c) for brevity. For a subset V of F, we let prof (a) [F'] denote the 
restriction of the vector prof (a) to V . Namely, the vector obtained by zeroing in prof (a) 
all coordinates v ^V' . 

Now fix o" € C and two distinct u,v . Assume n, v is an inversion in a with respect 
to vr, and that |vr(it) — vr(u)| = b for some integer b. Then either |vr(it) — a{u)\ > 6/2 or 
|vr(f) — cr{v)\ > b/2. We will "charge" the inversion to argmax^gj^ {l'''"(-2) — c(-z)|}lj For 
any u €V, let charge ^.(it) denote the set of elements v V such that {u,v) is an inversion 
in a with respect to vr, which is charged to u based on the above rule. The function reg^(cj) 
can now be written as 

reg,(c7)=2j^ (C„,.((t) - C„,,(vr)) , (11) 

uGV t)6charge^(?i) 

where we recall the definition of Cu,v in Equation ([9]). Indeed, any pair that is not inverted 
contributes nothing to the difference. From now on, we shall remove the subscript vr, because 
it is held fixed. Similarly, our estimator f{cr) can be written as 

/(O") = 2 E E E i^^A'^) - C'«,^'(^))li«„,>,tGcharge,M • 

ueu i=o ^ t=i 

8. For the sake of definition assume an arbitrary indexing such that V = {ui : i = 1, . . . , n}. 

9. Breaking ties using some canonical rule, for example, charge to the greater of u, v viewed as integers. 
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For any even integer M let C/o-^m denote the set of all elements u such that 

M/2 < |7r(u) - a{u)\ < M . 

Let U(j,<M denote: 

Ua,M' ■ 

M'<M 

Now consider the following restrictions of reg{a) and f{cr): 

reg(a,M) = 2 ^ ^ {CuA^) - CuA^)) (12) 

u&Ua,M t)Gcharge^(n) 
[logn] p 

f{a,M) 4 2 J] ^ ^ ^(C„,,(a)-a,,(7r))l,^^^^^g,harge,(„) (13) 
ueu„,M i=o t=i P 

Clearly, f{a,M) is an unbiased estimator of reg{a,M). Let To-,m denote the set of all 
elements u (^V such that \tt{u) — cr(ti)j < eM. We further split the expressions in (jl2p -(jl3 p 
as follows: 

reg(f7, M) = A{a, M) + B{a, M), and f{cj, M) = A{cj, M) + B{cj, M), (14) 

where, 

A{a,M) = 2 E (C.,.(ct) - C,,,(7r)) (15) 

u&U„^M t)echarge^(«)nTCT,M 

i(a,M) ^ 2 5] ^E(^«.^(-)-^«.''(-))l.....cha.ge.(.)nT^ (16) 

(•) is set complement in V , and B{a, M),B(a, M) are analogous with T^^m instead of T^^m, 
as follows: 

B{a,M) = 2 ^ Yl (CuA^) - CuA^)) (17) 

u&U,j^M i;echarge^(n)nT<3.,Af 

B{a,M) ^ 2 Y Yl ~^Yi^^A'^)-CuA'^))'^v^,,,techargeMnT^,M (1^) 
ueu„,M i=o ^ t=i 

We now estimate the deviation of A{a, M) from A{a, M). Fix M. Notice that the expres- 
sion A{a,M) is completely determined by non-zero elements of the vector prof((7)[?7o-,<A/ H 
To-,m]- Let Ja,M denote the number of nonzeros in picof(a)[Ua,M]- Each nonzero coordinate 
of prof((T)[C/o-,<M n Tct^m] is bounded below by eM in absolute value by definition. Let 
P{d,M) denote the number of possibilities for the vector prof((T)[To-,A/] for a running over 
all permutations satisfying dspicr) = d. We claim that 

P{d, M) < n^'^/^^*^^) . (19) 
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Indeed, there can be at most d/ (eM) nonzeros in prof ((t)[To-,m]; and each nonzero coordinate 
can triviaUy take at most n values. The bound follows. 

Now fix integers d and J, and consider the subspace of permutations a such that Ja^M = 
J and (isF(c) = d. Define for each u G U^^m, i £ [logra] and t = 1, . . . ,p a random variable 
Xu,i,t as follows 

Clearly A{a,M) = ^Y^^^jj^ Xu,i,t ■ For any u eV, let i„ = argmaxj < 4M, and 
observe that, by our charging scheme, j j = almost surely, for all i > iu and t = 1 . . .p. 
Also observe that for all u,i,t, \Xu,i,t\ < '^N^^\Iu,i\/p < 2^^^ jp almost surely. For a random 
variable X, we denote by H-'^lloo the infimum over numbers a such that X <ol almost surely. 
We conclude: 

E EEii^«,mIil< E Eiv-v^^+v/<c2P-^A^-VM2 

for some global C2 > 0. (We used a bound on the sum of a geometric series.) Using 
Hoeffding bound, we conclude that with the probability that A{a, M) deviates from its 
expected value of A{a,M) by more than some s > is at most exp{— s^p/(2c2 JM^iV"^)}. 
We conclude that the probability that A{a, M) deviates from its expected value by more 
than eci/(A^logn) is at most exp{— cie^d^p/( JM^ log^ n)}, for some global ci > 0. Hence, 
by taking p = 0{e~^d^^M J log^ n), by union bounding over all P{d,M) possibilities for 
prof((T)[To-_A/], with probability at least 1 — n"'^ simultaneously for all a satisfying Ja,M = J 
and dspio') = d, 

\A{a,M) - A{a,M)\ <ed/{Nlogn) . (20) 

But note that, trivially, JM < d, hence our choice of p in ([8]) is satisfactory. Finally, 
union bound over the 0(n^ log n) possibilities for the values of J and d and M = 1,2,4,.. 
to conclude that ([20]) holds for all permutations a simultaneously, with probability at least 
1 — n~'^. 

Consider now I3{a,M) and B{a,M). We will need to further decompose these two ex- 
pressions as follows. For u € Ua,M-,'^Q define a disjoint cover {T^^ w'^u a m) °^ chargeo.(ti)n 
Tcr,M as follows. If 'k{u) < a{u), then 

Tl^M = {^^ T^,M : tt{u) +eM < 7r{v) < a{u) - eM} . 

Otherwise, 

Tu,^,M = {v^ T^,M : <t{u) + eM < 7r{v) < 7r(n) - eM} . 
Note that by definition, T^^j^^ C chargeo.(ti). The set T'^„m is thus taken to be 

^n,<7,M - (charge^ (n) n T^^m) \ TI^^M ■ 
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The expressions B(a, M), B{a, M) now decompose as B^{a, M)+B^{a, M) and B^{a, M)+ 
13'^ {a, M), respectively, as follows: 

B\a,M) = 2 Yl {CuA<^) - CuA-^)) (21) 

B\a,M) = 2 E {CuA^) - CuA^)) (22) 

[log "1 I T I P 

B\a,M) = 2 Y H ^E(^-.-(^)-^«.-(^))l-..MeTiK-,M) (23) 

ueU^,M «=0 ^ t=l 

flog"! ij I p 

B\a,M) = 2 Y H ^E(^-.-(^)-^«.-(^))l-..MeT2K-=A^) • (^4) 

Now notice that B^{a,M) can be uniquely determined from prof(cj)[To-,Af]- Indeed, 
in order to identify T^^j^.j for some u G Ua,Mi it suffices to identify zeros in a subset of 
coordinates of prof (a) [Tq-.m], where the subset depends only on prof((T)[{ti}]. Additionally, 
the value of Cu,v{<y) — Cu,v{t^) can be "read" from prof (a) [To-,m] (and, of course, Y{u^v)) 
if w € T^fjM- Hsnce, a Hoeffding bound and a union bound similar to the one used for 
bounding \A{a,M) — A{a,M)\ can be used to bound (with high probability) \B^{(t,M) — 
B^{a, M)\ uniformly for all a and M = 1, 2, 4, as well. 

Bounding \I3'^{a,M) — B'^{a,M)\ can be done using the following simple claim. 

Claim 9 For u & V and an integer q, we say that the sampling is successful at {u, q) if the 
random variable 

|{(i, t) : Avu,i,t) G [vr(n) + (1 - e)q, 7r(n) + (1 + e)q] U [7r(u) - (1 + e)q, 7r(u) - (1 - e)q]}\ 

is at most twice its expected value. We say that the sampling is successful if it is successful 
at all u ^ V and q < n. If the sampling is successful, then uniformly for all a and all 
M = 1,2,4,..., 

\&{a,M) - B^{a,M)\ = 0{eJ^^MM/N). 
The sampling is successful with probability at least 1 — n"'^ if p = 0{e^'^ logn). 

The last assertion in the claim follows from Chernoff bounds. Note that our bound ([8]) on 
p is satisfactory, in virtue of the claim. 

Summing up the errors \A{a,M) — A{a,M)\, \B{a,M) — B{a,M)\ over all M gives us 
the following assertion: With probability at least 1 — n~^, uniformly for all cr, 

|/(c7)-regJ(T)| <edsF(cT,7r)/(2A^) , 



where dsF(o') tt) is the Spearman Footrule distance between a and vr. Bv lDiaconis and Graham 



(|l977l ). dist(cj, vr) is at most twice (isF(c, vr)/A^. This concludes the proof. 



Note that by the choice of p in Equation the number of preference queries required 
for computing / is 0(e~^n log^ n). We conclude from Theorem[8l our bound on the number 
of preference queries, and Algorithm [1] defined in Corollary HI the following: 
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Corollary 10 There exists an active learning algorithm for obtaining a solution n G C 
for LRPP with erx)(vr) < (1 + 0{e)) v with total query complexity of O log^ n) . The 

algorithm succeeds with probability at least 1 — n~^. 

Corollary 1101 allows us to find a solution of cost (1 + e)z^ with query complexity that is 
slightly above linear in n (for constant e), regardless of the magnitude of ly. In comparison, as 
we saw in Section HI known active learning results (and in particular Corollary [6]) that used 
disagreement coefficient and VC dimension bounds only guaranteed a query complexity of 
Q(ni'~^), tending to the pool size of n(n — l) as v becomes small. Note that u = o(l) is quite 
realistic for this problem. For example, consider the following noise model. A ground truth 
permutation vr* exists, Y{u, v) is obtained as a human response to the question of preference 
between u and v with respect to vr*, and the human errs with probability proportional to 
|7r*(n) — TT*{v)\~P. Namely, closer pairs of item in the ground truth permutation are more 
prone to confuse a human labeler. The resulting noise is = for some p > 00 



5. Application 7^2: Clustering with Side Information 



Clustering wit h side informa t ion is a fair ly new variant o f clus tering first described, inde- 
pendently, by Demiriz et al. ( 19991 ). and Ben-Dor et al. ( 19991 ). In the machine learning 
community it is also widely known as semi-supervised clustering. There are a few alter- 
natives for the form of feedbac k providing the side- information. The most natural ones 



are the single item labels (e.g., Demiriz et al. . 19991 ). and the pairwise constraints (e.g. 



Ben-Dor et all . Il999l ^. 



Here we consider pairwise side information: "must" / "cannot" link for pairs of elements 
u,v S V. Each such information bit comes at a cost, and must be treated frugaly. In a 
combinatorial optimization theoretical setting known as correlation clustering there is no 
input cost overhead, and similarity information for all (quadratically many) pairs is avail- 
able. The goal there is to optimally clean the noi se (nontransitiyity) . Correlation clustering 



was defined in (Bansal et al.. 200' 



iy clean tne noi se ( nontransitiyity ) . (correlation clustering 
J), and also in ( Shamir et al. . 20041 ) under the name clus 



ter editing. Co nstant factor approximations are known for y arious minimization versions of 
this problems ( Charikar and Wirthl . 2004 : Ailon et al. . 20081 ) . A P TAS is known for a min- 
imiza tion version in which the number of clusters is fixed to be k ([Giotis and Guruswamil . 
20061 ) ■ as in our setting. 

In machine learning, there are two main approches for utilizing pairwise side informa- 
tion. In the first approach, this information is used to fine tune or learn a distance function, 
which is t hen passed on to any standard clust e ring algorithm such as fc-means or fc-inedians 



Csee. e.g..lKlein et al.l. [20o3: IXing et al.l . l2002l : ICohn et al.l . I2OO0I : IShamir and Tishbvl . I2OII 



Voevodski et al 



2012! ^. oach, which is more related to our work, modifies 

the clustering algorithms's o bject iv e so as to inc orporate the pairwise constraints (see, e.g. 



Basil l200,4 Eriksson et al.l . I2OIII ). iBasul (120051 ) in his thesis, which also serves as a com- 
prehensive survey, has championed this approach in conjunction with fc-means, and hidden 
Markov random field clustering algorithms. In our work we isolate the use of informa- 
tion coming from pairwise clustering constraints, and separate it from the geometry of the 
problem. In future work it would be interesting to analyze our framework in conjunction 



10. Our work does not assume Bayesian noise, and we present this scenario for illustration purposes only. 
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with the geometric structure of the input. Interestingly Eriksson et alj (|201ll ) studies active 
learning for clustering using the geometric input structure. Unlike our setting, they assume 
either no noise or bayesian noise. 



5.1 Problem Definition 

Let y be a set of points of size n. Our goal now is to partition V into k sets (clusters), where 
k is fixed. In most applications, V is endowed with some metric, and the practitioner uses 
this metric in order to evaluate the quality of a clustering solution. In some cases, known 
as semi-supervised clustering, or clustering with side information, additional information 
comes in the form of pairwise constraints. Such a constraint tells us for a pair u,v (z V 
whether they should be in the same cluster or in separate ones. We concentrate on using 
such information. 

Using the notation of our framework, X denotes the set of distinct pairs of elements 
in V (same as in Section [5]), and T>x is the corresponding uniform measure. The label 
y((ti,t')) = 1 means that u and v should be clustered together, and y((n, = means 
the opposite. Assume that Y[{u,v)) = Y[{v,u)) for all u. f 1^ 

The concept class C is the set of equivalence relations over V with at most k equivalence 
classes. More precisely. Every /i G C is identified with a disjoint cover Vi, . . . ,Vk of U (some 
Vi's possible empty), with /i((n, f)) = 1 if and only if u,v E Vj for some j. As usual, Y 
may induce a non-transitive relation (e.g., we could have y((n, w)) = Y(^{v,z)^ = 1 and 
y((ti,z)) = 0). In what follows, we will drop the double parentheses. Also, we will abuse 
notation by viewing h as both an equivalence relation and as a disjoint cover {Vi, . . . , Vk} 
of V. We take T> to be the uniform measure on X. The error of /i E C is given as 
erx)(/i) = A^^^ J2{uv)£X ^h{u,v)ytY{u,v) where, as before, = \X\ = n{n — 1). We will define 
costtt^t,(/i) to be the contribution A^~^l/i(u,v)^y(M,D) of {u,v) E to er^). The distance 
dist(/i,/i') is given as dist(/i,/i') = A^-^ E(«,^)gA' '^h{u,v)^h'{u,v)- 

5.2 The Ineffectiveness of Using Disagreement Coefficient Arguments 

We check again what disagreement coefficient based arguments can contribute to this 
problem. It is easy to see that the uniform disagreement coefficient of C is 0(n). In- 
deed, starting from any solution h G C with corresponding partitioning {Vi, . . . ,Vk}, con- 
sider the partition obtained by moving an element u (z V from its current part Vj to 
some other part Vj' for j' ^ j. In other words, consider the clustering h' £ C given by 
{Vj' U {u},Vj \ {u}} U \Jt^{jj>} {Vi}- Observe that dist(/i,/i') = 0(l/n). On the other 
hand, for any v E U and for any u €z V there is a choice of j' so that h and h' obtained as 
above would disagree on {u,v) E X. Hence, Picx>;^ [DIS {B {h,0{l/n)))] = 1. 

It is also not hard to see that the VC dimension of C is 0(n). Indeed, any full matching 
over V constitutes a set which is shattered in C (as long as A; > 2, of course). On the 
other hand, any set S C of size n must induce an undirected cycle on the elements of V. 
Clearly the edges of a cycle cannot be shattered by functions in C, because if h{ui,U2) = 
h{u2,uz) = ■ ■ ■ = h{ui-i,Ui) = 1 for /i E C, then also h{ui,ue) = 1. 

11. Equivalently, assume that X contains only unordered distinct pairs without any constraint on Y . For 
notational purposes we preferred to define X as the set of ordered distinct pairs. 
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Using Corollary [6l we conclude that we'd need r2(n^) preference labels to obtain an 
(e, /m)-SRRA for any meaningful pair (e,^). This is useless because the cardinality of X 
is O(n^). As in the problem discussed in Section [U this can be improved using Remark S] 
to 0(ni/~^), which tends to quadratic in n as becomes smaller. We next show how to 
construct more useful SRRA's for the problem, for arbitrarily small v. 

5.3 Better SRRA for Semi-Supervised ^-Clustering 

Fix h £ C, with h = {Vi, . . . , V^} (we allow empty Uj's). Order the Uj's so that \Vi\ > IV2I > 
■ ■ ■ ^ 1^1- We construct an e-SRRA with respect to h as follows. For each cluster Vi £ h 
and for each element u £ Vi we draw k — i + 1 independent samples Sui, 5'u(j+i), . . . , Suk as 
follows. Each sample Suj is a subset of Vj of size q (to be defined below), chosen uniformly 
with repetitions from Vj, where 

g = C2 max {e~^A;^, e~^A;} log n (25) 

for some global C2 > 0. Note that the collection of pairs {{u,v) £ X : v £ Sui for some 1} 
is, roughly speaking, biased in such a way that pairs containing elements in smaller clusters 
(with respect to h) are more likely to be selected. 
We define our estimator / to be, for any h' G C, 

m = E V ^ ^ + 2 E E ^ E f^Ah') , (26) 

where fu,v{h') = cost„,„(/i') - cost„,^(/i) and costu,„(/i) = N~^lh(u,v)^Y{u,v)- Note that the 
summations over Sui above takes into account multiplicity of elements in the multiset *S*^2. 

Theorem 11 With probability at least 1 — the function f is an e-SRRA with respect 
to h. 

Consider another /c-clustering h' E C, with corresponding partitioning {U/, . . . ,U^'} of 
V. We can write dist(/i, /i') as 

d\st{h,h') = distu^t,(/i, /i') 

where d\stu,v{h,h') = N^^{ly(u,v)=i'^h{u,v)=o + ^h{u,v)=i'^h'{u,v)=o)- 

Let rii denote \Vi\, and recall that rii > 722 > • • • > n^. In what follows, we remove 
the subscript in reg^ and rename it reg [h is held fixed). The function reg(/i') will now be 
written as: 

k I k \ 

reg(/i') = EE E reg,,,(/i') + 2 ^ 5]reg,,,(/i') (27) 

i=\u&/,\v^Vi\{u\ i=i^\v&j ) 

where 

reg„ (/i') = cost„,^,(/^') - cost„,^,(/^) . 
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Clearly for each h' it holds that f{h') from (j26p is an unbiased estimator of reg(/i'). We 
now analyze its error. For each «,j G [k\ let Vij denote Vi CiVj. This captures exactly the 
set of elements in the i'th cluster in h and the j'th cluster in h' . The distance dist(/i, /i') 
can be written as follows: 

(k k k \ 

JZEl^^i ^ E l^n^-x^^2il • (28) 

i=l j = l j = l l<ji<i2<fc J 

We call each cartesian set product in (j28|) a distance contributing rectangle. Note that unless 
a pair (n, appears in one of the distance contributing rectangles, we have reg^j,(/i') = 
fu,v{h') = 0. Hence we can decompose reg(/i') and f{h') in correspondence with the 
distance contributing rectangles, as follows: 

k k k 

veg{h') = E E ^*.^(^') + 2 E E G,,,,,,{h') (29) 

i=l j=l j=l l<ii<i2<k 

k k k 

/(^') = EE^m(^')+2E E ^n,nAh') (30) 

where 



G^Ah') = E E reg„,,(/i') (31) 
F,,,{h') = M ^ ^ (32) 
Gn,r,Ah') = E E /".-(^') (33) 

Fn,.Ah') = ^ E E f-Ah') (34) 

ueVi-^^j veVi2jr\Sui2 

(Note that the Sui^s are multisets, and the inner sums in ([32]) and ([3^ may count elements 
multiple times.) 

Lemma 12 With probability at least 1 — n~^, the following holds simultaneously for all 
h' ^ C and all i,j € [k]: 

\GiAh') - Fi,,{h')\ < eiV-i • \Vij X {Vi \ Vij)\ . (35) 

Proof The predicate (135]) (for a given i,j) depends only on the set Vij = ViCi V-. Given a 
subset i? C VJ, we say that h' (z, j)-realizes B if Vij = B. 

Now fix i,j and B <^ Vi. Assume h' (i, j)-realizes B. Let (3 = \B\ and 7 = \Vi\. 
Consider the random variable Fij{h') (see ([32])). Think of the sample as a sequence 
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5ui(l), . . . , Sui{q), where each Sui{s) is chosen uniformly at random from Vi for s = 1, . . . ,q. 
We can now rewrite Fij{h') as fohows: 



where 



q 

^ uGB s=l 



Z(v) 



fu,v{h') veV^\Vij 

otherwise 



For ah s = 1, ... (7 the random variable Z(^Sui{s)) is bounded by 2N ^ almost surely, and 
its moments satisfy: 



E [Z{Sui{s))] =- /«--(^') 

^ ve{v,\v^j) 

m~^{j - f3) 



Z{Sm{s))^ < "^' " . (36) 
J 7 



E 

From this we conclude using Bernstein inequality that for any t < N~^f3{'y — /3) 

PT[\F,^jih')-Gij{h')\ >t] <exp| 
Plugging in t = eN~^j3{'^ — P)-, we conclude 



qt' 

167/3(7 - fi)N 



Pr [\F,,{h') - G,,{h')\ > eN-^(3{j - /?)] < exp j-^^^^-^ 

Now note that the number of possible sets B Vi of size /3 is at most n'^^^^^'i'-^^. Using 
union bound and recalling our choice of q, the lemma follows. 



Proving the following is more involved. 

Lemma 13 With probability at least 1 — , the following holds uniformly for all h' G C 
and for all ii,i2,j G [k] with ii < 12 

|G,,,,(/.0-F.,.„,(/.')l <^iV-imax||U,, x U.,,i ^ ^ ^"^'^ ^ ^ ^"^'^l } 

(37) 

Proof The predicate ([37|) (for a given ii,i2,j) depends only on the sets Ujjj = Vi-^ fl Vj 
and Vi^j = H V^'. Given subsets Bi C and B2 ^ Cjj, we say that h' (ii, i2, j)-realizes 
(5i, S2) if Vi,j = Bi and Vi^j = B2. 

We now fix ii < i2,j and Bi C Uj^, B2 ^ Vi^. Assume h' (zi, 22, j)-realizes {Bi,B2). 
For brevity, denote /3t = |-Bt| and 7^ = |UtJ for l = 1,2. Using Bernstein inequality as in 
Lemma [T2I we conclude that 
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^A\G^,,nAh') - >t]< exp . (38) 

for any t in the range [O, N~^l3il32\ , for some global C3 > 0. For t in the range (A^~^/3i/32, 00), 

Pr[|G.,,„,(/^') - F,,,,,{h')\ > t] < exp {-;^ 
for some global C4. We consider the following three cases. 



(39) 



1. /3i/32 > max{/3i(7i-/3i)/A;,/32(72-/32)A}. Hence, /3i > (72 - /32)A, /^s > (7i-/3i)A- 
In this case, we can plug in ([38]) to get 

Fr[\G,,,,,,{h') - F,,,,,,{h')\ > eiV-Vi/32] < exp j-^^^^^j . (40) 

Consider two subcases, (i) If /32 > 72/2 then the RHS of (|10]) is at most exp | — ^^^^2^}- 
The number of possible subsets Bi,B2 of sizes /3i,/32 respectively is clearly at most 
^/3i+(72-/32) < j^Pi+kPi _ Therefore, as long as g = 0(e~^A;logn) (which is satisfied 
by our choice ()25p ). then with probability at least 1 — this case is taken care of 
in the following sense: Simultaneously for all j, ii < Z2, all possible /3i < 71 = jV^J, 
/S2 < 72 = l^ijl satisfying the assumptions and for all Bi C Vi^j,B2 C Vi^j of sizes 
/?!, /32 respectively and for all h' (ii, ^2, j)-realizing {Bi,B2) we have that |Gii,i2 j(^') — 
-^n,«2,j(^')l ^ ^l^ih ■ (ii) If P2 < 72/2 then by our assumption, /3i > 72/2A;. Hence 
the RHS of (I40p is at most exp The number of sets B2 of sizes /3i, /32 

respectively is clearly at most n'^'^i"^^^"'"^^ < ^^feCi+fc) (]-,y q^j^ assumptions and since 
clearly 72 < /32). Therefore, as long as g = 0{e^'^k'^\ogn) (satisfied by our choice), 
then with probability at least 1 — this case is taken care of in the following sense: 
Simultaneously for all j, ii < 12, all possible /3i < 71 = [, /32 < 72 = I 1 satisfying 
the assumptions and for all Bi C Vi-^j,B2 C Vi^j of sizes /3i,/32 respectively and for 
all h' (zi, ?2 5 jl-realizing {Bi,B2) we have that IGj^^jj j(/i') — Fi-^^,^2,jih')\ < ef^ih ■ 

2. /32(72 — > max{/3i/32, /3i(7i — /3i)//c}. We consider two subcases. 

(a) e/32 (72 - l32)/k < Pip2- Using ([38]), we get 

FT[\G,,,,,ih')-F,,,,,{h')\ > 5iV-i/32(72-/52)A] < «^p{-^^^^^|l|^^ 

(41) 

Again consider two subcases, (i) /32 < 72/2. In this case we conclude from (|4ip 



Pr[|G,,,,„,(/i') - F,,,„,(/i')| > eiV-V2(72 - /32)/A:] < exp j- '^^g^'^ | (42) 

Now note that by assumption 

/3i < (72 - ^2)/k < 72/fc < ji/k . (43) 
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Also by assumption, 

Pi < /32(72 - /32)/(7i - < /3272/(7i - • (44) 

Plugging ([43|) in the RHS of ([Ml), we conclude that /3i < /3272/(7i(l - 1/^)) < 
2/3272/71 < 2/^2 (the last step was in virtue of our assumption 71 < 72). From 
here we conclude that the RHS of ()i2]) is at most exp | — '^^^^^^^'^ | . The 

number of sets Bi,B2 of sizes /3i,/32 respectively is clearly at most n^'^'^i^^ < 
^2/32+/32 < fi3'y2_ Hence, as long as q = 0(e~^fe^logn) (satisfied by our as- 
sumption), with probability at least 1 — simultaneously for all < i2, all 
possible /3i < 7i = |ViJ, /32 < 72 = {Vi^l satisfying the assumptions and for all 
Bi Q ^ij,-B2 ^ ^2j of sizes /3i,/32 respectively and for all h' (ii, ^2, J^-realizing 
{Bi, B2) we have that IGj^^jj j(/i') — -^ii,j2 j(^')l ^ £/52(72 — ■ In the second 

subcase (ii) ^2 > 72/2. The RHS of gl]) is at most exp {-^^^^^f^}- By our 
assumption, (72 — l32)/{k(3i) > 1, hence this is at most exp | — ^^^^-^^p^^ | • The 

number of sets Bi, B2 of sizes (3i,(32 respectively is clearly at most n^'^'^i'y^"'/^^) < 
^(72-/32)/fc+(72-/32) < „2(72-/32)_ Therefore, as long as q = ©(e'^fclogn) (satis- 
fied by our assumption), then with probability at least 1 — n~^, using a similar 
counting and union bound argument as above, this case is taken care of in the 
sense that: {Gi^^i^jih') - Gij,j2,i(^')l - ^-^2(72 - P2)/k . 
(b) e/32 (72 - /32)/fe > /3i/32. We now use ([39]) to conclude 

PT[\G,,,i,,,{h') - F,,,,„,(/i')| > eiV-V2(72 - (32)/k] < exp | - ^^^^^^^^^^ 

(45) 

We again consider the cases (i) /32 < 72/2 and (ii) /32 > 72/2 as above. In (i), 
we get that the RHS of ([^5]) is at most exp | — | . Now notice that by our 
assumptions, 

/3i < e(72 - P2)/k < 72/2 < 71/2 . (46) 

Also by our assumptions, f3i < /32(72 — /32)/(7i — /3i), which by (|16|) is at most 
2/3272/71 < 2/32- Hence the number of possibilities for i?2 is at most n'^i+'^z < 

n^^K In (ii), we get that the RHS of (gS]) is at most exp | - } , and 

the number of possibilities for Bi,B2 is at most n^''-~^("/2~l^2) which is bounded by 
j^2(72-/32) i-,y assumptions. For both (i) and (ii) taking q = 0{e^^klogn) en- 
sures with probability at least 1 — , using a similar counting and union bound- 
ing argument as above, case (b) is taken care of in the SGI1S6 ths^tl \^ii,i2jj i.^ ) — 
Fi,,i2,j{h')\<eN-^(32{i2-(32)/k . 

3. /3i(7i — (3i)/k > max{/3i/32, /32(72 — f32)/k}. We consider two subcases. 

(a) e/3i(7i - f3i)/k < /3i/32. Using we get 

Pr[\G,,,,,,{h')-F,,,,,,ih')\ > £iV-Vi(7i-/3i)/fc] < exp j- ^^^^^'^^J^^^^" ^^^'^ | . 

(47) 
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As before, consider case (i) in which /32 < 72/2 and (ii) in which /32 > 72/2. For 
case (i), we notice that the RHS of gTD is at most exp | - ^2£!M2|^^gli^^ | 
(we used the fact that /?i(7i — /3i) > /32(72 — P2) by assumption). This is hence 
at most exp | — ^^^^-^^p-^^ | • The number of possibihties of -61,-62 of sizes /3i, /32 
is clearly at most < n^^i-MHu-M/k < ^2(71-/31)^ Ytoui this we 

conclude that q = 0{e''^k'^ log n) suffices for this case. For case (ii), we bound the 
RHS of glD by exp | - ^^'^^^jj,'^/'^^'' } • Using the assumption that (71 - /?i ) //32 > 

k, the latter expression is upper bounded by exp | — ^^^^2^ | • Again by our 
assumptions, 

/3i > /32(72 - /?2)/(7i - /5i) > (e(7i - Pi)/k)h2 - - Pi) = ^(72 - P2)/k . 

(48) 

The number of possibilities of Bi, B2 of sizes /3i, is clearly at most n/^^+i'y^ /^a) 
which by (I48p is bounded by n^'^+^l^t-/^ < n^'^^^/^. ^From this we conclude that 
as long as g = 0(e~'^A;logn) (satisfied by our choice), this case is taken care of 
in sense repeatedly explained above. 

e/3i(7i - f3i)/k > /3i/32. Using ([39]), we get 

Pi[\G,,,i,,,{h') - F,,,,„,(/i')| > eiV-Vi(7i - Pi)/k] < | _ ^4£/?i (71 - A ) 

(49) 

We consider two sub-cases, (i) /3i < 71/2 and (ii) /3i > 71/2. In case (i), we have 
that 

Piiii-M ^ 1 Mil -Pi) ^ i /?i(7i-/?i) 
72 2 72 2 72 

> l/3i7i 1 /32(72-/32) 
- 2 272 2 72 

>/3i/4 + min{/32,72-/32}/2 . 
(The last step used 72 > 71.) Hence, the RHS of (fl9]l is bounded above by 



exp 



C4£g(/3i/4 + min{/32,72 - P2}/2) \ 

k / • 



The number of possibilities of -Bi, -B2 of sizes Pi, P2 is clearly at most j7,/'i+™i'^{/52,72- 
hence as long as g = 0(e~^A;logn) (satisfied by our choice), this case is taken 
care of in the sense repeatedly explained above. In case (ii), we can upper bound 
the RHS of m by exp > exp |-£4£(3|^}. The number of 

possibilities of -81,-62 of sizes Pi,P2 is clearly at most n^'y'i-~M+P2 -which, using 
our assumptions, is bounded above by n^'''^^'^^^^*-'^^"^^^/'^ < n'^^^'^~^''-\ Hence, 
as long as g = 0(e~^fclogn), this case is taken care of in the sense repeatedly 
explained above. 
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This concludes the proof of the lemma. ■ 
As a consequence, we get the following: 

Lemma 14 with probability at least 1 — n^^, the following holds simultaneously for all 
k- clusterings C : 

|reg(/i') - f{h')\ < 5edist{h',h) . 
Proof By the triangle inequality, 

k k 

\regih') - fih')\ <Y.tl \G^Ah') - F.Ah')\ 
1=1 i=i 

k 

j=l l<ii<i2<k 

(50) 

Using (j29p - ()30p . then Lemmata 1121- [T3l (assuming success of a high probability event), rear- 
ranging the sum and finally using (p8]) . we get: 



k k 

Iregih') -f{h')\ <^^eN-'\Vi.j x {Vi\V,,, 
i=i j=i 

k 

+ 2eN-' E (l^^ii X y^2J\ + k-'\V,,, x {V,, \ V,,,)\ + k~'\V,,, x {V,, \ V,,j)\) 

j=l ii<i2 

k k k 

i=l j=l j=l ii<i2 

k k k k 

j=lil<i2 j=lii<i2 
k k k 

^ E E^^^'i^^j- X \ + 2^^"' E E i^i^- X 

i=l j=l j=l ii<i2 

k k k k 

+ 2eN-' Y E ^^"'l^ni X (y,, \ V,,j)\ + 2eiV-i ^ ^ fcfc- x (F^, \ y,,,)l 

j=l ii=l i=l *2=1 

k k k 

< 5eiV-i ^ Y I^^J ^ \ + E E l^^ii ^ 

i=l j=l j=l ii<i2 

< 5edist{h,h') , 

as required. ■ 

Clearly the number of label queries required for obtaining the e-SRRA estimator is 
0(n max{e~^A;^, e"^/::^} log n). Combining the theorem with this bound and the iterative 
algorithm described in Corollary [J] (Algorithm [T|) , we obtain the following: 
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Corollary 15 There exists an active learning algorithm for obtaining a solution h C 
for semi-supervised k-clustering with erx)(/i) < (l + 0(e))i^ vuith total query complexity of 
O (nmax |e~^A;'^, e~^A;^} log^ n) . The algorithm succeeds with success probabiltiy at least 

We do not believe the factor in the corohary is tight (see Section [7]). As in the 
case of Corollary [10] and the ensuing discussion around LRPP, the result in Corollary [15] 
significantly beats known active learning results depending only on disagreement coefficient 
and VC dimension bounds, for small i^. 

6. Additional Results and Practical Considerations 

We discuss two practical extensions of our results. 

6.1 LRPP over Linearly Induced Permutations in Constant Dimensional 
Feature Space 

A special class of interest is known as LRPP over linearly induced permutations in constant 
dimensional feature space. We use the same definition of X as in Section |4l except that 
now each point v (zV is associated with a feature vector, which we denote using bold face: 
v G W^. The concept space C now consists only of permutations vr such that there exists a 
vector € M'^ satisfying 

7r('u,'t;) = l <;=^ (w,u-v)>0. (51) 

We are as suming familiarity w ith the theory of geometric arrangements, and refer 
the reader to de Berg et al. ( 20081 ) for further details. Geometrically, each {u,v) € X 



is viewed as a halfspace Hu,v = {x : (x, u — v) > 0}, whose (closure) supporting hypler- 
plane is hu,v = {x : (x, u — v) = 0}. Let H be the collection of these (2) hyperplanes 
{hu,v ■ {UjV) € The collection C corresponds to the maximal dimensional cells in the 

underlying arrangement A{'H). We thus call A{T-L) from now on the permutation arrange- 
ment, and we naturally identify full dimensional cells with their induced permutations. We 
denote by C,r C R'^ the unique cell corresponding to a permutation vr G C 

Bounding the VC dimension and disagreement coefficient. Using standard tools 
from combinatorial geometry, the VC dimension of C is at most d—1. Roughly speaking, this 
property follows from the fact that in an arrangement of m hyperplanes in d -space, each of 



which meeting the origin, the overall number of cells is at most 0{m'^ ^), see lde Berg et al 
(I2OO8I ). 



As for the uniform disagreement coefficient, we show below that it is bounded by 0{n). 
Let vr € C be a permutation with a corresponding cell C,r in A{7i). The ball -^(vr, r) is, geo- 
metrically, the closure of the union of all cells corresponding to "realizable" permutations a 
satisfying dist(fT, vr) < r. The corresponding disagreement region D1S{B{7t, r)) corresponds 
to the set of ordered pairs (halfspaces) intersecting B{TT,r). We next show: 

Proposition 16 The measure ofT)IS{B{7r,r)) in T>x is at most 8rn. 



12. Note that hu v = , 
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Proof By Diaconis and Graham ( 19771 ) . the Spearman Footrule distance between any two 
permutations vr and a is at most twice A^dist(7r, a), where = n{n — 1). Hence, if dist(7r, a) 
is r, then any element u could only swap locations with a set of elements located up to 2rN 
positions away to the right or to the left. This yields a total of ArN 'swap-candidates' for 
each u. Thus, at most ArNn inversions are possible. Note that each inversion corresponds 
to a hyperplane (unordered pair) that we cross, and thus the total number of ordered pairs 
is at most 8rNn. The probability measure of this set is at most 8rn, because we assign 
equal probability of A^~^ for each possible pair in X. The result follows. ■ 



By the proposition we have that the disagreement coefficient 6 is always bounded by 
0(n), establishing our bound. We now invoke Corollary [6] with ^ = 0(l/n^) (which is 
tantamount to /x = for this problem, because l^"! = O(n^) and we are using the uniform 
measure), and conclude: 

Theorem 17 An e-SRRA for LRPP in linearly induced permutations in d dimensional 
feature space can he constructed, with respect to any tt G C, with probability at least 1 — S, 
using at most O {nd£~'^ log^ n + ne~^(logn) (log(5~"'^ logn))) label queries. 

Combining Theorem [T71 and the iterative algorithm described in Corollary HI 

Corollary 18 There exists an algorithm for obtaining a solution vr € C for LRPP in linearly 
induced permutations in d dimensional feature space with erx)(vr) < {1 + 0{e)) v with total 
query complexity of 

O {e^"^ ndlog^ n + n£~'^ {\og^ n) (log((5~-^ logn)) ^ (52) 

The algorithm succeeds with success probabiltiy at least 1 — 5. 

We compare this bound to that of Corollary [TOj For the sake of comparison, assume 
5 = so that (I52p takes the simpler form of 0(e~^n(ilog^ n/ log (1/e)) . This bound is 
better than that of Corollary 1101 as long as the feature space dimension d is 0(e~^logn). 
For larger dimensions. Corollary [10] gives a better bound. It would be interesting to ob- 
tain a smoother interpolation between the geometric structure coming from the feature 
sp ace and the combinat o rial s tructure coming from permutations. We refer the reader 
to iJamieson and NowakI (120111 ) for a recent result with improved query complexity under 



certain Bayesian noise assumptions. 



6.2 Convex Relaxations 

So far we focused on theoretical ERM aspects only. Doing so, we made no assumptions 
about the computability of the step hi = argmin^^/g^ fhi^iih') in Corollary [4] (Step [3] in Al- 
goroithm[T]). Although ERM results are interesting in their own right, we take an additional 
step and consider convex relaxations. 

Instead of optimizing erjy{h) over the set C, assume we are interested in optimizing 
efx)(/i) over h C, where C is a convex set of functions from X to M. Also assume there is 
a mapping : C i— >• C which is used as a "rounding" procedure. For example, in the setting 
of Section [6.11 the set C consists of all vectors w G M'^, and the rounding method : C i— > C 
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converts w to a permutation it satisfying (jSip . When optimizing in C, one conveniently 
works with a convex relaxation er-p : C — t- M+ as surrogate for the discrete loss er^), defined 
as follows 



(x,y)~-p 



L {h{X),Y 



(53) 



where L : M x {0, 1} i-^ M"*" is some function convex in the first argument, and satisfying 



< cL{h{X),Y 



for all /i G C and X £ X, where c > is some constant. In words, this means that L 
upper bounds the discrete loss (up to a factor of c). A typical choice for the example in 
Section [6.11 would be to define for all w € C and X = {u,v) € X: w(X) = (w,u — v), and 
L{a,b) = max{l — a(26 — 1), 0}. Using this choice, optimizing oyer (1531) becomes the famous 
SVMRank with the hinge loss relaxation (jHerbrich et al.l . 12000 : Ijoachimsl . I2OO2I I: 



Minimize -F(w, ^) 



(54) 



s.t., Vu, V, Y(u, v) = 1 : (u — v) • w > 1 — 

Vm, V : (,u,v > 
||w|| < c. 

(Note: c is a regularization parameter.) 

We now have a natural extention of relative regret: feg^(/i') = erx){h') — erx>(/i). By our 
assumptions on convexity, reg^ : C 1— )• can be efficiently optimized. We now say that 
f : C ^ M^" is an (e, /i)-SRRA with respect to /i G C if for all h' C, 



Tegj^,{h')-f{h') <e(^dist(0(/i),<A(/i'))+/xj . 
If /i = then we simply say that / is an e-SRRA. The following is an analogue to Corollary^) 



Theorem 19 Let ho, hi, /12, • • • be a sequence of hypotheses in C such that for all i > 1, 
hi = argmin^,gj /i_i(/i'), where is an {e, ^)-SRRA with respect to /ij-i. Then for all 
i > 1, 

eiv{hi) = (1 + 0(e)) u + 0{e')eTv{ho) + 0(e/i) , 
where v = inf^^^ erx)(/i) and the 0-notations may hide constants that depend on c. 

The proof is very similar to that of Corollary HI and we omit the details. It turns out 
that the sampling techniques used for constructing an e-SRRA from Section 14.31 can be 
used for constructing an e-SRRA for the SVMRank relaxed version as well, as long as C is 
restricted to bounded vectors w and all the feature vectors v corresponding to v are 
bounded as well. It is easy to confirm that under such bounded-norm setting all arguments 
of Section [4. 3l follows through. The conclusion is that we can solve SVMRank, in polynomial 
time, to within an error of (1 + e)t' using only 0(n poly(log n, e~^)) preference queries. 
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7. Conclusions and Future Work 

In this work we showed that being able to estimate the relative regret function using carefully 
biased sampling methods can yield query efficient active learning algorithms. We showed 
that such estimations can be obtained when the only assumptions we make are bounds on 
the disagreement coefficient and the VC dimension. This leads to active learning algorithms 
that almost match the best known using the same assumptions. On the other hand, we 
presented two problems of vast interest (mostly outside but increasingly inside the active 
learning community), for which a direct analysis of the relative regret function produced 
better active learning strategies. The two problems we studied are concerned with learning 
relations over a ground set, where one problem dealt order relations and the other with 
equivalence relations (with bounded number of equivalence classes). In both problems our 
query complexity bounds had an undesirable factors of which we believe should be 
reduced to using more advanced measure concentration tools. We leave this to future 
work. It would also be interesting to identify other problems for which our approach yields 
active learning algorithms with faster than previously known convergence rates. Immediate 
candidates are hierarchical clustering and metric learning. Finally, for LRPP, we discussed 
a practical scenario in which the ground set is endowed with feature vectors. We showed 
how to take the underlying geometry into account i n our framewo r k. W e did not do so 



for clustering with side information. The work of [Eriksson et al.l (|201ll ) indicates that 
incorporating geometric information into our analysis is a fruitful direction to pursue. 

Our work made no assumptions on the noise, except maybe for its magnitude. Another 
promising future research direction would be to incorporate various standard noise assump- 



tions known to irnprov e active learning rates (especially the model of lMammen and Tsvbakovl . 
19991 : iTsvbakovl . Eooi) within our settmg. 
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